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19 Abstract 

20 Copy number variable (CNV) regions often have phenotypic consequences, due to 

21 gene dosage effects. These consequences are thought to expose these regions to selective 

22 pressures. Previous literature has identified the CNV locus of AMY1 , a gene encoding the 

23 salivary amylase isoform, appears to have been under selection (Perry et al, 2007); it is a 

24 locus with substantial variation with diploid total ranges from 2—15. 

25 Haplotype blocks are regions of low diversity commonly having 2-4 major haplotypes 



26 that are inherited as discrete units. This discrete nature allows them to be used to explore 

27 population structure, and could be used to tag AMY1 CNV identity for population studies. 

28 Therefore it is necessary to ascertain whether haplotype blocks and CNV at the AMY1 locus 

29 are associated. 

30 In order to carry out this study we designed genotyping assays for three SNPs; 

31 rs1 21 30703, rs1 2075086 and rs1 185098. These segregated 370 European diploid samples 

32 into 4 common haplotypes and 1 rare haplotype. 2 of these haplotypes represent 80% of the 

33 population studied. We then compared these haplotypes to CNV. This CNV was genotyped 

34 by a tetra-repeat microsatellite found in every copy of the AMY1 copy unit. This 

35 microsatellite's length polymorphism provided an extra dimension with which to describe 

36 CNV with. 

37 We observed no significant association of CNV identity with haplotype structure at 

38 the AMY1 locus. Therefore we conclude that there is strong CNV-haplotype structure within 

39 European individuals, and haplotype identity cannot be used to predict AMY1 CNV, nor tag 

40 its identity within the population. 
41 
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42 Introduction 

43 Genomes, including the human genome, are not a sequence of completely 

44 independent loci, but are structured sequences with extensive association with nearby loci, 

45 so that one allele at one position can be strongly linked with another. Linkage disequilibrium 

46 (LD) describes this non-random association of loci (Wall and Pritchard, 2003), where 

47 alleles segregate together. This state has been described extensively (Lewontin and 

48 Kojiana, 1960; Slatkin, 2008). LD is commonly measured using r 2 (the extent of correlation) 

49 or using D' (Hedrick, 1987). This 'linkage' is different from genetic linkage, which describes 

50 the tendency of closely positioned loci on a chromosome to be inherited together during 

51 meiosis. 

52 LD is structured into blocks of associated loci, such as single nucleotide 

53 polymorphisms (SNPs) (Daly et al, 2001 ), that are present across the entire human genome 

54 (Gabriel et al, 2002), and present limited diversity (Patil et al, 2001), often having 2-4 

55 common haplotypes per block (Daly et al, 2001 ). These blocks are flanked by recombination 

56 hotspots (Jeffreys et al, 2001) (Philips et al, 2003), therefore the contents of these LD 

57 blocks do not frequently recombine. The alleles within subsequently are not then randomised 

58 by recombination. Therefore the content of these blocks are inherited through generations 

59 with discrete identities. 

60 In evolutionary terms these blocks evolve and mutate, creating 'branches' off of root 

61 haplotypes (Underhill et al, 2000). Therefore investigating these changes by mapping the 

62 diversity within a population can provide us with information on population structure and 

63 history (Cuciani et al, 201 1 ; Kayser et al, 2001 ; Zegura et al, 2003). 

64 Haplotype blocks are commonly investigated using SNPs, and because of the tightly 

65 associated nature of the SNPs within-block, it is only necessary to genotype a few SNPs in 

66 order to 'tag' the identity of the block as a whole (Johnson et al, 2001 ). Further analysis has 

67 shown this tagging method does not significantly hamper the statistical power of studies 

68 investigating haplotype block diversity (Zhang et al, 2002; Zhang et al 2004). 
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69 These haplotype blocks can also be indicative of evolutionary pressures such as a 

70 selective sweep as well as help investigating population structure (Slatkin, 2009; 

71 Jakobsson et al, 2008). These infrequently recombining haplotypes could be carried along 

72 with the selected locus by positive selection via linkage. This model is often called genetic 

73 'hitchhiking' (Smith and Haigh, 1974). Therefore the investigation of haplotypes may allow 

74 us to study regions thought to have been exposed to selective pressures. 

75 One class of genetic variation that is of interest to this kind of study is copy number 

76 variation (CNV); where a genomic sequence larger than 1 kb presents variable copy numbers 

77 (Conrad et al, 2010). CNV can range from single site deletion of sequence (McCarroll et al, 

78 2008) to tandem repeats of a unit. There are many examples of this class of polymorphism 

79 in the human genome. Examples include the; FcGR3 (Koene et al, 1998; Breunis et al, 

80 2009), beta-defensin (Holox et al 2003), opsin (Nathans et al, 1986), olfactory (Rouquier 

81 et al, 1998) and the salivary amylase (Groot et al, 1989) loci. 

82 Segmental duplications are a common form of CNV (Bailey et al, 2008) and the 

83 generation of CNV can be somatic or meiotic (Bruder et al, 2008). CNV can be produced by 

84 Homologous Recombination and Non-Homologous Repair Mechanisms (Hastings et al, 

85 2009). There is also evidence that retrotransposons such as the Alu transposon (which is 

86 primate specific) have a role within CNV generation (Bailey et al, 2001 ; Kidd et al, 2008). 

87 There are several methods for measuring this variation. A common form is to use a 

88 SNP array comparative-gene-hybridisation (array-CGH) method to measure the relative 

89 differences of individuals against a known reference sample (Redon et al, 2006). 

90 Fluorescent in situ hybridisation (FISH) techniques also give a clear visual confirmation of 

91 such measurements (Perry et al, 2007). More recently the paralogue ratio test (PRT) 

92 (Armour et al, 2007) has been used. This is a method that simultaneously amplifies the 

93 CNV locus and a single copy reference to provide a copy number (CN). 

94 Several examples of CNV also demonstrate phenotypic consequences arising from 

95 their variation; such as salivary amylase (Perry et al, 2007; Mandel et al, 2010), FcGR3A 

96 (Breunis et al, 2009) and FcGR3B (Wilcocks et al 2008). This gene dosage effect is 
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97 believed to expose CNV to evolutionary pressures (Zhang et al, 2009). From these 

98 observations is it thought that CNV can be a component of long term evolution, where the 

99 duplicated genes can develop new functions ("neofunctionalisation"). Example of such 

100 evolution can be found in the; immunity genes (Hurles, 2004; Breunis et al, 2009), olfactory 

101 genes (Rouquier et al, 1998) or opsin genes (Nathans et al, 1986). 

102 It is generally unknown whether certain copy number "identities" (such as low or high 

103 copy number) are associated with haplotype blocks. Some CNV regions (CNVRs) have had 

104 coincidental analysis into haplotypes of CNV identities in the past. For example, the FcGR3 

105 locus, which encodes a low affinity IgG receptor. One piece of evidence of possible 

106 haplotype structure at the locus is a systemic lupus erythematous (SLE) associated SNP 

107 (FcGR2B-l232T). This was crucially found to be associated with a low CN of FcGR3B 

108 (Niederer et al, 2010), what the causative source of this association is, is unknown. Another 

109 piece of evidence is a study in 2008 (Breunis et al), which found 4 'patterns' of CNV based 

110 on which three FcGR2/3 genes were duplicated as part of a tandem. 

111 It is unknown whether these patterns are discrete "CNV haplotypes" that are being 

112 transmitted or coincidentally similar patterns of CNV, and serve as an example of the lack of 

113 general knowledge of how CNVR relate to the appearance of the haplotype structure of the 

114 human genome. 

115 Another CNVR with a proposed evolutionary component, but no haplotype analysis, 

116 is the AMY1 gene locus (1p21 - Entrez Gene). The AMY1 gene encodes a member of the 

117 amylase gene family; secreted enzymes that hydrolyse a1 ,4-glucoside bonds in starch to 
lis produce maltose, maltotriose and larger oligosaccharides. This is a process that starts in the 

119 mouth and is finished in the small intestine (Mandel et al 2010), with separate salivary 

120 (AMY1) and pancreatic (AMY2A and B) isoforms (Samuelson et al, 1988). Salivary amylase 

121 is a key component within saliva, being the most predominant protein in human saliva 

122 (Oppenheim et al, 2007.) 
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123 A structure of the AMY gene locus is shown in FIGURE 1A, as previously described 

124 (Samuelson et al, 1990). The AMY1 and AMY2 genes share remarkably similar sequences, 

125 showing as much as 98% sequence identity in their coding regions (Horii et al, 1987). Each 

126 AMY gene contains a primate specific y-actin pseudogene insert (Samuelson et al, 1996), 

127 and the AMY1 specific retroviral insert into the pseudogene activates a cryptic promoter for 
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Figure 1 

The AMY locus, on the short arm of chromosome 1 . The positions of the AMY genes are shown (A), 
adapted from (Samuelson et al, 1990) with the copy number variable region indicated. AMY1 (salivary) 
are shown in blue with AMY2 (pancreatic) genes shown in green. The primate specific y-actin 
pseudogene insert present at all amylase genes are shown in purple, with the AMY1 specific retroviral 
insert (ERV) shown in orange. This retroviral insert confers salivary specific expression (Meisler and 
Ting, 1993) by activating a cryptic promoter site in the -actin pseudogene. Arrows indicate gene 
direction, with AMY1 B being inverted, and AMYP1 being a pseudogene. Also shown are the relative 
positions of the SNPs assayed (B), centromeric to the CNVR, within a block of linkage disequilibrium 
(shown by below LD plot, red indicates strong LD). 
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128 salivary expression. Studies have proven that this insert can cause transgenic salivary 

129 expression (Meisler and Ting, 1993). 

130 CNV at the salivary amylase gene locus has been known for some time, being first 

131 described in 1989 (Groot et al). This study investigated the cause of previously described 

132 polymorphic salivary amylase expression patterns (Pronk et al, 1982). Restriction mapping 

133 demonstrated a CNV locus in a family pedigree, with two well described haplotypes, one 

134 with a 1 CN and another with 7. The paper concluded that this variation was due to 

135 homologous-unequal crossover, a model subsequently agreed on (Samuelson et al, 1990). 

136 Salivary amylase protein levels are also shown to be polymorphic, and that these levels 

137 correlate with AMY1 CN (r 2 =0.351), as described in 2007 (Perry et al). In their study they 

138 found that the correlation was very significant, with a p < 0.001 although some of the 

139 variation could not be explained just by CN. Subsequent findings have found this relationship 

140 to affect resultant activities of the action of salivary amylase. A study in 2010 (Mandel et al) 

141 found that higher salivary amylase, which their own data correlated with AMY1 CN, 

142 correlated with increased digestion of oral starch. A following study (Mandel et al, 2012) 

143 found that high AMY1 CN also correlated with an improved glycaemic homeostatic response 

144 to ingesting a starch solution. It is argued, therefore, that salivary amylase improved the 

145 anticipatory insulin response to carbohydrate ingestion, by an (at time of writing) unknown 

146 mechanism, 

147 These finding suggest that the AMY1 CNV can be influenced by selective pressures. 

148 This argument is further supported by additional findings from Perry et al (2007). They 

149 described human populations that have had historically low starch consumption (classically a 

150 hunter-gather society), or have had historically high starch consumption (classically 

151 agricultural) varied in average CN. They reported that the low starch populations carried a 

152 low diploid CN (-4-5) whereas high starch population would have a high diploid CN (-6-9). 

153 These observations build a model describing the variety of AMY1 CNV, that the increase 

154 of starch consumption has selected for higher CNs; as a high CN increases the ability of 

155 starch breakdown (Mandel et al, 2010), as well as improve the homeostatic response to 
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156 starch ingestion (Mandel et al, 2012). Varying population needs for starch intake would 

157 produce an overall variation in CN. 

158 This model makes the salivary amylase CNVR an interesting target for evolutionary and 

159 population study. One way to investigate this is to ask whether CN identities are associated 

160 with surrounding haplotype structure. If so, then discrete patterns of CNV could be able to be 

161 tracked by studying a population's haplotype structure. 

162 Therefore this study aims to genotype SNPs (and thereby capture haplotype block 

163 diversity) in several hundred individuals that have had their CNV also genotyped with the 

164 AMY1 locus. This CNV will have previously been genotyped, primarily by genotyping a 

165 microsatellite found at the locus. We will then compare the two data sets in each individual 

166 and ascertain whether any association with CNV and haplotype diversity can be found by 

167 statistically analysing the comparison. 

168 Methods 

169 rs1 21 30703 and rs1 2075086 PCR 

170 This was to amplify the PCR product that contained both rs1 21 30703 and 

171 rs1 2075086 sites so that they can be later digested (by BsA and Mnl\ respectively), and was 

172 carried out in LD buffer. To a 20ul total; 10ng/ul DNA, 10uM forward and reverse primers, 

173 511/ ul Taq polymerase (NEB), 50 uM Tris-HCI pH8.8, 12.5mM Ammonium sulphate, 1.4mM 

174 MgCI 2 , 7.5mM 2-mercaptoethanol, 200uM dNTPs and 125 ug/ml BSA. This produces a 1070 

175 bp PCR product. This was amplified for 37 cycles, with each cycle being; 95° for 30 

176 seconds, 56.8° for 30 seconds and 64.4° for 4 minutes, with a final extension step of 70° for 

177 2 minutes. The forward primer was 5'GGTACAAACGAAAACACTGACA3' and the reverse 

178 primer was 5 TTG CTTACAATG CTTG ACTTC3 ' . In order to develop this assay, known 

179 genotype individuals were used ECACC individual 39. 

180 rs1 11 85098 PCR 

181 This was used to amplify the region containing rs1 11 85098 so that it could be later 

182 digested by Fnu4H\. The PCR was carried out in x10 buffer. To a 20ul total; 10ng/ul DNA, 

183 10uM forward and reverse primers, 5U/ ul Taq polymerase (NEB), 50 uM Tris-HCI pH8.8, 

8 



CNV and Haplotype Association within the AMY1 locus. 
Edmund Gilbert 

184 12mM Ammonium sulphate, 5mM MgCI2, 7.4mM 2-mercaptoethanol, 1.1 mM dNTPs and 

185 125 ug/ml BSA. This produced a 1400 bp product. This buffer solution is, compared to the 

186 LD buffer, enriched in dNTPs and MgCI2 so that reactions, which produced low yield, 

187 produced enough for a restriction assay. This was amplified with an initial denaturing step of 

188 95° for 30 seconds followed by 37 cycles of 95° for 30 seconds, 55.6° for 30 seconds and 

189 65° for 3 minutes. The forward primer was 5'TTCAAAAG ATCCCCTTCCTT3 ' and the 

190 reverse primer was 5 ' ATTTTGGTG G CATTTTTG G A3 ' . In order to develop this assay 

191 ECACC individual 39 was used. 

192 Microsatellite PCR 

193 This was used to amplify the tetra-repeat microsatellite believed to co-segregate with 

194 AMY1 . It amplifies from the flanks of the repeat and therefore amplifies a product variable in 

195 size to the length of the amplified microsatellite. It was carried out in LD buffer, in a 10ul 

196 total; 10ng/ul DNA, 10uM forward and reverse primers, 5U/ ul Taq polymerase (NEB), 50 uM 

197 Tris-HCI pH8.8, 12.5mM Ammonium sulphate, 1.4mM MgCI2, 7.5mM 2-mercaptoethanol, 

198 200uM dNTPs and 125 ug/ml BSA. This was amplified with an initial denaturing step of 95° 

199 for 5 minutes. This was followed by 24 cycles of 95° for seconds, 53.6 for 30 seconds and 

200 70° for 1 minutes, followed by a final extension step of 72° for 40 minutes. This final step 

201 was to make certain that the Taq added the A/T overhang to all products, as the 

202 microsatellite electrophoresis was sensitive to such base pair differences in length. The 

203 forward primer was FAM-5'ATTATCCTTTCACAGACAAAAG3' where FAM is a fluorescein 

204 derivative and creates a fluorescent primer, and the reverse was 

205 5TCCTCTAGGGTCATTCATTT3'. 

206 Agarose Gel Electrophoresis 

207 Gel electrophoresis was used to check the rs1 21 30703, rs1 2075086 and rs1 1 1 85098 

208 PCRs success, as well as to visualise the RFLP assay. It was run in a 2% agarose gel of 

209 x0.5 TBE buffer and 0.5pg/ml ethidium bromide. The 100 bp ladder was used for the 

210 BsAIMnh gel runs, and a 1 kb ladder was used for the Fnu4H\ gel runs, because of the sizes 

211 of the PCR and RFLP products. These were carried out at 120V. 
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212 Restriction Fragment Length Polymorphism Assay 

213 Bsti Assay: For a 15ul total; 5|jl of Bsr\ PCR product, 1 x NEB buffer 3, 2.5U of Bsr\ 

214 and incubate overnight at 65°. Mn/I assay: For a 15ul total; 6ul of Mnl\ PCR product, 1x 

215 NEB buffer 4, 2.5U of Mnl\ and incubate overnight at 37°. Fnu4H\ assay:, For a 15ul total; 

216 5ul of Fnu4H\ PCR product, 1 x NEB buffer 4, 1 .511 of Fnu4H\ and incubate overnight at 37°. 

217 NEB buffer 3 contains; 1 0OmM NaCI, 50mM Tris-HCI, 1 0mM MgCI 2 , 1 mM dithiothreitol. 

218 pH7.9 @ 25°. NEB buffer 4 contains; 50mM Potassium acetate, 20mM Tris-acetate, 10mM 

219 Magnesium acetate, 1 mM dithiothreitol. pH.7.9 at 25°. 

220 To develop the Bsr\ and Mnl\ assays the following HapMap CEPH (Centre d'Etude 

221 du Polymorphisme Humain) plate 01 individuals were used as known genotype controls; 

222 NA1 21 44, A1 271 7, NA 1 2264, NA07357, NA1 1 831 , NA1 1 830. To develop the Fnu4H\ assay 

223 individuals from HapMap CEPH plat 01; NA12044, NA12815, NA12043, NA12892 and 

224 NA1 2873 were used. 

225 Microsatellite Capillary Electrophoresis 



226 For every 16 samples, 10ul was added to each sample from an initial mix of 2ul Rox 

227 molecular marker and 170ul HiDi formamide. 2ul of the microsatellite PCR product was then 

228 added to this. It was denatured at 95° for 3 minutes, then run on an ABI 3130x1 (Applied 

229 Biosystems), with 1 kV for 30 seconds. 

230 Results 

231 In order to initially assess the viability of investigating whether haplotypes can be 

232 used to predict copy number identity, preliminary work was first carried out on existing data 

233 with individuals from the International HapMap Project, CEPH plate 01 . These individuals 

234 are well described in terms of SNPs, allowing haplotypes to be inferred from tagging SNPs. 

235 Individuals from the HapMap project were useful because the individuals described 

236 are described as haploid individuals. This is a result of being part of well-defined pedigrees. 

237 Following the transmission of haplotypes through these families allows the segregation of 

238 diploid genotypes into haplotypes. This "segregation analysis" allowed us to directly compare 
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239 the CNV haplotype to the microsatellite haplotypes, which increases the statistical power of 

240 the preliminary work. 

241 These "HapMap" individuals had previously had their copy number genotyped by a 

242 combination of a microsatellite genotype assay and AMY1 CNV PRT and non-PRT tests. 

243 This "microsatellite assay" is a PCR amplification of all copies of that microsatellite, which 

244 are then separated by size via capillary electrophoresis, yielding a genotype based on the 

245 length of each repeat, and the amount of repeats. This assay was carried out by Sugandha 

246 Dhar, a PhD student within the lab. 

247 Therefore in order to compare haplotype to CNV, only capturing the haplotype 

248 diversity was needed. Consequently, these CEPH individuals' SNP genotypes were 

249 investigated using the HapMap genome browser, release #24. Analysis into the SNP 

250 diversity at adjacent (centromeric) LD blocks (chrl :1 041 10000-104130000) yielded three 

251 SNPs, rs1 21 30703, rs1 2075086 and rs1 999478, see FIGURE 1B for positions. These 

252 segregated the data into a reasonable amount of haplotypes with which to compare to CNV. 

253 These SNPs would therefore tag the representative haplotypes within the European 

254 population samples used, as most haplotype blocks have 2-4 main haplotypes presented 

255 within the population (Daly et al, 2001). These chosen three SNPs tagged the segregation 

256 of individuals into 5 distinct haplotypes, designated; 1 , 2, 3, 4 and 5. 

257 After sorting individuals into their component haplotypes, associations with CNV were 

258 investigated. In the preliminary data, which had a sample size of 46, it appeared that these 

259 SNP haplotypes did not have any significant associations with total CN. However there did 

260 appear to be association with haplotypes 3 and 5 and an enrichment of short microsatellites 

261 (lengths <265 bp) (Chi-squared value = 7.30, p = 0.007, n = 46). This suggested that any 

262 association with haplotype structure and CNV identity would be found in microsatellite length 

263 variation, not total CN. However a larger data set was needed to confirm these hypotheses, 

264 and provide more analytical power. The larger data set would also be used to confirm the 

265 lack of association with CN, which was particularly surprising. 
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266 The three SNPs (rs1 21 30703, rs1 2075086 and rs1 999478) were then genotyped in 

267 ECACC individuals. These are "European Collection of Cell Cultures" individuals and were 

268 an independent data set from the HapMap individuals, allowing confirmation of the 

269 preliminary observations. These individuals had already been genotyped with respect to the 

270 AMY1 microsatellite by Sugandha Dhar, so that only the SNP genotyping needed to be 

271 carried out. 

272 In order to be able to genotype the SNPs, restriction fragment length polymorphism 

273 (RFLP) assays were developed for the three SNPs chosen. The rs1 21 30703 and 

274 rs1 2075086 polymorphisms were found to create or destroy a Bsr\ and Mnl\ site, 

275 respectively. The rs1 999478 polymorphism did not create a restriction site (sequence; 

276 CCCCnAAAA, where n is rs1 999478 - a C/A polymorphism) and the surrounding sequence 

277 made an allele specific PCR assay unsuitable for high-throughput genotyping. Therefore 

278 rs1 999478 was abandoned as a testable SNP. 



279 This meant, however, that haplotypes 1/3, and 2/4 could not be distinguished as the 

280 identity of rs1 999478 segregated 1/3, and 2/4. Therefore another SNP was sought to either 

281 "tag" rs1999478's segregation, or a SNP that would segregate the individuals in a similar 

282 manner, keeping as much of the original haplotypes as possible. 

283 A new SNP, rs1 11 85098, was found, preserving the old haplotypes, with the 



284 exception of 1 and 3. Under segregation by rs1 1 185098, haplotype 3 gains some individuals 

285 that previously belonged to haplotype 1 , the remaining forming the new haplotype 1 . In order 

286 to genotype rs1 1 1 85098 the enzyme Fnu4H\ is used in a RFLP assay, similarly to 

287 rs1 21 30703 or rs1 2075086. FIGURE 2 shows representative digestions of the BsA, Mnl\ and 

288 Fnu4H\ assays. 

289 TABLE 1 demonstrates the preliminary data, split by the SNPs; rs1 21 30703, 

290 rs1 2075086 and rs1 1 1 85098. 
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Figure 2 

Representation of digestion patterns of 
rs1 21 30703 digested by Bsrl (A), rs1 2075086 
digested by Mnll (B) and rs1 1 1 85098 digested 
by Fnu4HI (C). Select bands identify the alleles 
that the individual carries (D). Sample 
genotypes are also shown. All three gels 
represent different individuals. 



291 The data produced in this phase of genotyping produced diploid genotypes. This is 

292 due to the unrelated nature of the ECACC individuals, and therefore the SNP haplotype and 

293 CNV haplotype were not able to be easily phased (unlike in the case of the HapMap 

294 individuals). 

295 The diploid manner of the ECACC individuals did not cause the phasing of the SNPs to 

296 be impossible, however. The known associated nature of SNP to SNP allowed certain 

297 assumptions to be made in order to phase the diploid genotype (Clark, 1990). For example 

298 knowing that allele rs1 21 30703A is only associated with haplotype 5, allows AA and AG 
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Table 1 

The results of the preliminary investigation. Shown are the HapMap individuals tested; their 

haplotype as defined by rsl2130703, rsl2075086 and rslll85098 as well as their 
microsatellite (MS) profile, and total copy number (CN). The individuals are defined as their 
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genotypes to be defined immediately as respectively 5|5 or 5|any individuals. The diploid 
SNP genotypes that correspond to the diploid combinations of the haplotypes are shown in 
TABLE 2. SNP genotype and CNV genotype were unable to be phased, however. This 
meant that analysis would have to take into account the homozygous and heterozygous 
nature that we would see in a diploid population. The Cochran-Armitage Test (Cochran, 
1954) (Armitage, 1955) would be able to test such a system and rescue some of the lost 
power that analysing diploid data would cause. This test is a variant of the Chi-Squared Test, 
and tests for an association that is 'dose' dependent, and can be used in genetics to test for 
homozygote (++), heterozygote (+-) and homozygote (--) differences. 



The direct output of the microsatellite only 
gives the ratios of microsatellite length copy 
number (see FIGURE 3), so the majority of 
analysis was based on the presence/absence of 
microsatellite lengths, not total copy number. 
This was not deemed a loss of significant data as 
the preliminary data presented no clear evidence 
of association of SNP haplotype with total CN. 



Table 2 


The diploid SNP genotypes of all the 
permutations of the 5 haplotypes identified, 
allowing the genotyping of individuals based on 
their SNP diploid genotypes. 


Diplotyp 


s rsl2130703rsl2075086rslllS5098 


1|1 


| GG 


CC 


AA 


1 1 2 


| GG 


CT 


AA 


113 


i GG 


CC 


AG 


1|4 


| GG 


CT 


AG 


1|5 


| AG 


CC 


AG 


2 1 2 


| GG 


TT 


AA 


2|3 


| GG 


CT 


AG 


2|4 


| GG 


TT 


AG 


2|5 


| AG 


CT 


AG 


3|3 


| GG 


CC 


GG 


3|4 


i GG 


CT 


GG 


3|5 


1 AG 


CC 


GG 


4|4 


| GG 


TT 


GG 


4|5 


| AG 


CT 


GG 


5|5 


| AA 


CC 


GG 
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Figure 3 

Representative microsatellite trace, showing relative micro-satellite intensities. Higher relative peak height 
(y axis) indicates a higher copy number at that microsatellite size (x axis) relative to lower peaks. 
Therefore (A) shows a 1 :2:2:3 ratio (with respect to the sizes; 257, 261 , 265 and 269 bp). This 
corresponds to either an 8 copy individual, or a 16 (which would produce the same ratio pattern); 
however most likely the individual is an 8 copy, as 16 copy individuals are rarer (Perry et al, 2007). The 
red peaks shown are molecular markers, of a constant intensity of 400 and size of 50 bp intervals. Data 
shown is my own, not of Sugandha Dhar's wider genotyping work. 

317 From this stage of the experiment 450 individuals were initially genotyped, with respect 

318 to the SNPs rs1 21 30703, rs1 2075086 and rs1 11 85098. Due to some PCR/RFLP reactions 

319 not working and not all individuals were able to be retyped, this study uses the 370 individual 

320 that were able to be defined. This data set forms the basis of the following analyses, and a 

321 summary of the SNP identity, and frequency, of the haplotypes are shown in TABLE 3. 



322 With the data set 

323 gathered, before the testing 

324 of association hypotheses, it 

325 was important to ascertain 

326 whether the allele calling and 

327 the assumptions of SNP 



Table 3 


The five haplotypes described in this study, and the SNP identity 
that tags the haplotypes. Also shown are the frequencies of the 5 
haplotypes within the ECAC individuals tested. 


Haplotype 


rsl2130703 


rsl2075086 


rslll85098 


Frequency 


1 


G 


C 


A 


0.137 


2 


G 


T 


A 


0.052 


3 


G 


C 


G 


0.397 


4 


G 


T 


G 


0.001 


5 


A 


C 


G 


0.414 



328 association produced a dataset that could be trusted. The genotype frequencies of each 



16 



CNV and Haplotype Association within the AMY1 locus. 
Edmund Gilbert 

329 SNP were tested against an expected genotype frequencies, based on the allele 

330 frequencies, under the Hardy-Weinburg Equilibrium (HWE). The Bsi\ and Mnl\ assays 

331 presented no significant departure from observed and expected (Chi squared p = 1 .000, Chi 

332 squared p = 0.979 respectively). The Fnu4H\ assay however presented a significant 

333 departure from the expected (Chi squared p = 0.04). This resulted from an under calling of 

334 the AA allele, and after independent corroboration of genotypes the problem continued. We 

335 do not believe that this under-calling is a result of systematic error, however we cannot 

336 account for the lack of AA individuals. The association between the three SNPs being 

337 genotyped was also checked, and the results are shown in TABLE 4. The D' value of 1 

338 indicates the SNPs being in strong association, where we do not see all the possible 

339 genotypes. The r 2 value indicates association, but not absolute, which is to be expected as 

340 the three vary enough to tag the partitioning of the haplotypes. 

341 Within the preliminary data it 



342 was observed that haplotype 3 

343 contained a null 269 

344 microsatellite length alleles that 

345 was associated with no 



Table 4 


The results of testing the association strength and confidence 
of the three SNPs genotyped within the study. 


SNPs Compared 


D' Value 


r2 Value 


p Value 


rsl2130703 


rsl2075086 


1 


0.060 


1.56E-11 


rsl2075086 


rslllS5098 


1 


0.38 


2.21E-66 


rsl2130703 


rslllS5098 


1 


0.16 


3.22E-28 



346 microsatellite lengths >269. This prompted the hypothesis whether this was indicative a 

347 subpopulation of haplotype 3; if it would be we'd expect a bias within haplotype 3 for more 

348 null 269 alleles. Therefore, whether there was any significant difference between the 

349 numbers of null 269 alleles in haplotype 3 or other haplotypes in the larger data set was 

350 investigated. However it was found that there was no significant difference (Z = 1 . 141 , p = 

351 0.127, n = 376). The observations that haplotypes 3 and 5 were found to be associated with 

352 short alleles (lengths of 257 and 261 ) proved to be insignificant within the large data set (Z = 

353 0.724, p = 0.235, n = 376); as did the same relationship with haplotypes 3 and 5 alone (Z = 

354 0.326, p = 0.372, n = 376) (Z = 0.929, p = 0.1 77, n = 376), respectively. 

355 The ratio nature of the microsatellite data meant that the true integer CN value for the 

356 ECACC individuals could not be obtained. However we could round up the ratio values to 
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357 integers in order to produce some CN values with which to test. We used a Cochran - 

358 Armitage test defining case/control as those less than or equal to or greater than the total 

359 population median (6); haplotype 1 (Z = 1.28), haplotype 2 (Z = 0.40) and haplotype 5 (Z = 

360 1 .06) with all p = >0.1 and n = 370. Haplotype 3 appeared to have a significant association 

361 (Z = 1.67, p = 0.05, n = 370), but further analysis demonstrated that this was not truly 

362 significant. 

363 Discussion 

364 This study has found that there does not seem to be any significantly strong correlation 

365 with haplotype structure and CNV within the AMY1 locus. This is initially surprisingly, given a 

366 literature background that suggests that the locus has had a significant evolutionary history 

367 within the human population (Perry et al, 2007; Mandel et al, 2010 and 2012). 

368 The relatively small sample size of 370 means that we would not have the power to 

369 observe any subtle associations that may exist. The lack of strong association suggests two 

370 possibilities. Firstly, the sequence between SNPs and the CNVR are being recombined, so 

371 that the two are not linked. Or that the generation of new CNV identities is faster relative to 

372 the generation of new SNP haplotypes, disrupting any association. Considering the extent of 

373 high LD around the CNVR when we study the region in the HapMap genome browser, we 

374 believe that the latter explanation is more likely. 

375 The microsatellite information produces minimum ratios (see FIGURE 3), and not total 

376 copy number. Nevertheless we could still use this data to investigate CN as the minimum 

377 ratio (and resultant total CN) is still based on the 'true' CN. One issue is of multiples; a 1 :2:1 

378 ratio could be produced by a 4 copy individual or an 8 or 12 copy individual. This 'rounding 

379 down' to the minimum CN does cause the data set used to have a systematic bias to calling 

380 such individuals as lower CN than they would otherwise be called. This would mean that 

381 though in the larger picture CN would broadly reflect the population's CN distribution, the 

382 observed average CN will be called lower than it otherwise would. This means that the 

383 significant association with haplotype 3 we have observed (Z = 1 .67, p = 0.05, n = 370) is 

384 most likely insignificant if we imagine the removal of a bias towards lower CN. Work using 
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385 non-PRT and PRT tests in conjunction with the microsatellite data to produce a true CN 

386 profile for each individual, via a maximum-likelihood method, as (as of yet) be completed 

387 within the laboratory by Sugandha Dhar. For this reason the use of microsatellite ratios had 

388 to be used. 

389 Another issue with using the rounded ratios of microsatellites is the idea of a 'minimum 

390 ratio value'. If a ratio gives 1:2:3 we could be quite confident that the minimum ratio value 

391 would be 1. If we imagine a situation where the ratios gave 1:1.5:1.5 we may round up to 

392 1 :2:2 to get a total of 5 and a minimum ratio value of 1 . However, the ratio pattern 1 :1 .5:1 .5 

393 could easily represent a 2:3:3 ratio (and now a 8 copy individual); the process to convert in 

394 ratio having reduced this to a minimum ratio value of 1. Therefore analysis into whether 

395 individuals with ratio numbers around n.5 at one or more lengths. 15 such individuals 

396 identified were. These individuals' genotypes were then converted into CNV identities that 

397 better represented the microsatellite ratio implications and incorporated into the final 

398 analyses. 

399 We have also assumed that every copy of the AMY1 gene is completely associated with 

400 1 copy of the microsatellite. Unpublished data suggests that this is the case, with no known 

401 exceptions. If this were the case however, the CN would only vary slightly, and therefore not 

402 significantly impact the study. 

403 The haplotypes studied are not homogenous in their diversity within the population. The 

404 most frequent of the haplotypes were 3 and 5, being almost equally prominent (see TABLE 

405 3). This prominence was so much so that haplotypes 3 and 5 represent 80% of the 

406 population. These frequencies support previous literature on the limited diversity within 

407 haplotype blocks (Daly et al, 2001; Patil et al, 2001), increasing confidence that we have 

408 captured the representative haplotype diversity at this locus. 

409 The method of calling the diploid haplotype genotypes resulted in the majority of 

410 expected SNP diploid genotypes being able to be defined, but did assume no new 

411 haplotypes were observed. We did not observe any genotypes unable to be explained by 
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412 TABLE 2, the only anomalous genotype (data not shown) was found to be a genotyping 

413 error. Therefore the rate of false positives is believed to be insignificant. 

414 The only ambiguous diploid genotype was GG/CT/AG (rs1 21 30703, rs1 2075086 and 

415 rs1 11 85098 respectively). This meant that the 1|4 and 2|3 diploid genotypes were 

416 indistinguishable. However these individuals were all used in later analysis, as 2|3 

417 individuals. This was because the rarity of haplotype 4 (being observed only once in this 

418 study) meant that of the total 35 ambiguous individuals, no haplotype 4 would be reasonably 

419 expected. Whilst haplotype 4's rarity does not contribute to the studies focus on association 

420 it does help increase confidence in the accuracy of our capture of haplotype diversity; in that 

421 we are observing the rare haplotypes expected in a population study. 

422 The SNPs that were used to tag the haplotype structure at the AMY1 locus were all 

423 significantly associated with each other, see TABLE 4. The p values for the D' and r 2 values 

424 are all extremely significant and provide us confidence that these SNPs tag LD structure 

425 around the AMY1 locus. The D' (Hedrick, 1987) values for all the pairs are 1 . Given that not 

426 all the possible four genotypes have been observed this is an unsurprising value. The r 2 

427 values are somewhat lower than the D' values, this was also expected because each pair is 

428 not associated with only one genotype (which would return an r 2 value of 1). The variation at 

429 these loci that produces this r 2 value was the reason for those SNPs to be chosen, so that 

430 they could tag multiple haplotypes. 

431 This study also concentrated on European samples, both in the preliminary data that 

432 used samples from CEPH and testing a wider population size within ECACC individuals. The 

433 European sample group was felt to provide enough variation to investigate whether there 

434 was any general association with haplotypes and CNV. However we would expect to 

435 observe other haplotypes in other populations, such as Asian or African. With the latter we 

436 would expect a greater extent of variation (Frisse et al, 2001), so were not chosen for such 

437 a preliminary study as the greater variability would obscure any association we would or 

438 would not expect to see. 
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439 The expectation that the European samples represent a subdivision within the 

440 greater human diversity is supported by wider SNP diversity within the haplotypes tested. 

441 General SNP background of the haplotypes indicate that haplotype 5 is part of an outlier 

442 group of haplotypes; with the 5/other split represented by a clear division in general SNP 

443 diversity. This segregation can be defined (within the haplotypes studied) by the possession 

444 of an A allele at rs1 21 30703. Cochran-Armitage analysis of CN between this 5 versus the 

445 rest also resulted in a p value >0.05. Perhaps investigation within other populations would 

446 provide other, intermediate haplotype blocks. 

447 It is entirely reasonable that given a strong selective force for certain AMY1 CNs in a 

448 recent time and/or small isolated population, that an association with haplotype and CNV 

449 could be observed. Such studies would be very interesting to see if there are any 

450 populations with such strong evidence for recent selection in AMY1 considering the previous 

451 literature (Perry et al, 2007; Mandel et al, 2010 and 2012) and the lack of association found 

452 in this study. 

453 We have only seen the 4 major haplotypes of the locus; 1,23 and 5 with haplotype 4 

454 being a rare haplotype. We did not observe any other haplotypes than the ones that were 

455 observed in the preliminary stage of investigation. Therefore the 4 major haplotypes we see 

456 in this large data set represent the majority of European individual haplotypes. We would 

457 expect that in a larger scale investigation more rare haplotypes, such as haplotype 4, would 

458 be observed. However we would expect that these haplotypes are not relevant to the 

459 question of if the AMY1 locus CNV is associated with haplotype structure. Considering the 

460 lack of association structure with the major haplotypes of the locus, the rare haplotypes 

461 would not be expected to have any association structure, as well as their rarity would 

462 present analytical difficulty. 

463 The four SNPs that have been discussed (rs1 21 30703, rs1 2075086, rs1 1 1 85098 and 

464 the abandoned rs1 999478) are not the only SNPs at the loci studied - the centromeric and 

465 telomeric adjacent LD blocks. Most of the SNPs studied in European samples at the 

466 centromeric locus segregate at the major two haplotype split, which is tagged by 
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467 rs1 21 30703. A handful of SNPs were observed to vary enough, these would segregate the 

468 haplotypes into more defined partitions. The SNPs investigated (such as rs1 21 30703, 

469 rs1 2075086, rs1 999478 and rs1 1 1 85098) were these SNPS. Also, the aim of this study was 

470 to compare the general haplotype structure with AMY1 CNV and to observe whether there 

471 was any general association, not to provide a lineage map of the adjacent haplotypes. 

472 Considering that the centromeric haplotype block has no association with the CNV it is 

473 unlikely that the telomeric haplotypes would be associated with CNV, so the SNPs that are 

474 within that block do not need to be considered for the initial hypothesis of this study. 

475 The SNP allele frequencies are in agreement to the HapMap frequencies (all p = 

476 >0.9) enough to be confident that the frequencies of the allele, genotype and haplotype are a 

477 true representation. The sample sizes of the HapMap derived allele frequencies cluster 

478 around 100 individuals, so this larger data set may be expect to not match perfectly because 

479 the difference in random sampling variation. 

480 We also tested whether the observed genotype frequencies were significantly different 

481 than the expected genotype frequencies derived from the HWE (pp+2pq+qq), if there was a 

482 departure from the HWE. The observed versus expected for rs1 21 30703 and rs1 2075086 

483 were not-significant (p = >0.95), however there was a significant departure in the case of 

484 rs1 1 185098 (p = 0.04). This significance is removed if the amount of AA genotype calls were 

485 to be increased by about 7. It seems, therefore, the genotype AA is has been under-called in 

486 the study, and one explanation is that the calling process somehow had a systematic bias for 

487 calling AA another genotype. However, upon closer inspection were the entire data set was 

488 re-called independently of the original caller this problem still arose. Following such analysis 

489 we are confident that the genotypes that have been called at the rs1 11 85098 position are 

490 correct, however we have not been able to account for this under-representation of the AA 

491 genotype. 
492 

493 In conclusion, this study has found no significant association of CNV with haplotype 

494 structure; we believe that this is because the generation of CNV at the AMY1 locus is faster 
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495 than the generation of new haplotypes. This would therefore disrupt any association that 

496 appeared historically. We believe further research into the extent of this lack of association, 

497 such as into isolated populations can shed light on the evolutionary structure of copy number 

498 variation at the AMY1 locus. 
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